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ABSTRACT 


Considerable success has been achieved in the development of the speaker dependent 
speech recognition systems The cuirent focus is on the development of the speaker 
independent speech recognition systems with robustness to noisy environments 
Obtaining the best parametric representation for the speech signals which is robust to 
both the above conditions is an important task in designing the speech recognition 
system This thesis compares the performance of the three front end processors based on 
LPC mel scale and scale transform in vowel recognition and isolated digit recognition 
The perfoiraance is compared with respect to interspeaker variations and noise For 
vowels the data is generated from the TIMIT database and for digits it is obtained from 
Oiegon Giaduate Institute digit database The isolated digit recognition system is based 
on the vectoi quantization scheme in which multiple reference patterns are used to 
lepiesent each digit Some earlier studies demonstrated that mel scale based front end 
piocessois perfoiTn better than LPC models and are almost comparable to auditory model 
based fiont end processois The results in this thesis shows that the scale transform based 
liont end piocessors significantly outperform the LPC and the mel scale models under 
inteispeakei variations and noisy conditions 
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1 INTRODUCTION 


Automatic recognition of speech by machines has been a goal of research for more than 
foui decades and has inspired such science fiction wonders as the robot R2D2 in the 
George Lucas classic Star Wars senes of movies By automatic speech recognition we 
mean a system which takes as input the acoustic waveform produced by the speaker and 
pioduces as output a sequence of linguistic words corresponding to the input utterance 

Howevei inspite of the glamour of designing an intelligent machine that can 
lecognize the spoken word and comprehend its meaning and inspite of enormous 
research efforts spent in trying to create such a machine we are far from achieving the 
desired goal of a machine that can understand spoken words on any subject by all 
speakeis in all environments 

In the present scenario we have acceptable speaker dependent speech recognition 
systems and lesearch is now focused on the development of speaker independent systems 
By speaker independent speech recognition systems we mean a system which is quite 
lobust to inter speaker variations 

There are two main problems for robust speech recognition 
1 Differences in the vocal tract size among individual speakers contribute to the 
variability of speech waveforms The first order effect of a difference in the vocal 
tiact IS the scaling of the frequency axis a female speaker for example exhibits 
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formants roughly 20% higher than the formants from a male speaker with the 
differences most severe in open vocal tract configurations 
2 The inherent mismatch between tram and test environment is another problem 
Speech recognition m adverse conditions has to deal with the effects of different noise 
levels combined with the influence of different recording channels 

Speech recognition has two mam approaches 

1 The pattern recognition approach The pattern recognition paradigm has four steps 
namely 

1 1 Feature measurement in which a sequence of measurements is made on the input 
signal to define the pattern 

1 2 Pattern training in which one or more patterns corresponding to speech sounds of 
the same class are used to create a reference pattern of that class 
1 3 Pattern classification in which the unknown test pattern is compared with each 
class reference pattern and a measure of similanty between the test pattern and 
each reference pattern is computed 

1 4 Decision logic in which the overall pattern similarity scores are used to decide 

which reference pattern best matches the unknown test pattern 

2 Hidden Markov Model approach The assumption of this approach is that the speech 
signal can be well charactenzed as a parametnc random process and that the 
parameters of the stochastic process can be determined or estimated in a precise well 
defined manner There are three fundamental problems for HMM design 

2 1 The evaluation of probability of sequence of observations given a specific HMM 
2 2 The determination of a best sequence of model states 

2 3 The adjustment of model parameters so as to best account for the observed signal 

Whatever may be the approach used for speech recognition perhaps the greatest 
common denominator of all speech recognition systems is the signal processing front end 
which converts the speech signal to some type of parametric representation The selection 
of the best parametnc representation of acoustic data is an important task in the design of 
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any speech lecognition system The usual objectives in selecting a representation are to 
compicss the speech data by eliminating information not pertinent to the phonetic 
analysis of data and to enhance those aspects of the signal that contribute significantly to 
the detection of phonetic differences 

A wide lange of possibilities exist for parametrically representing the speech signal 
These include the short time energy zero crossing rate level crossing rate and other 
1 elated parameters Probably the most important parametric representation is the short 
time spectral envelope Spectral analysis methods are therefore generally considered as 
the core of the signal processing front end in speech recognition system 

Speech recognizers frequently use linear predictive coding (LPC) spectral analysis 
model 01 the filter bank spectral analysis model as the front end processors Recently 
there has been considerable work and interest in using mel cepstral coefficients as 
acoustic features This is pnmanly due to the work of Davis and Mermelstein [ 3 ] who 
compared a number of parametric representations and found improved recognition 
peiformance using mel based cepstrum A recent article by Jankowski et al [ 4 ] compares 
the mel cepstrum with linear prediction and auditory model based front ends They 
conclude that mel cepstral based front ends significantly outperform LP features and are 
ilmost comparable to auditory model based front ends The motivation for using mel 
fiequency based cepstrum stems from psychoacoustic experiments done to study the 
human auditory system Recently scale cepstrum based front end processor [ 5 ] has been 
pioposed as an alternative to both mel cepstral and LPC models for speech spectral 
analysis The scale transform based cepstrum is motivated by speaker normalization 
techniques which is necessary since different speakers have different formant frequencies 
foi the same vowel [ 5 ] 

In this report we have compared the performance of the different signal processing 
fiont ends when used for recognition of vowels and isolated digits The report is 
01 ganized as follows Chapter n describes the basic signal processing front ends In this 
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chapter some important properties related to linear prediction filterbanks and scale 
tiansform are also reviewed Chapter DI and IV describes the experimental framework 
and the procedures for selection of speech data to compare the recognition accuracies of 
front ends m clean and noisy environments Finally in chapter V the results obtained with 
various representations are listed and discussed from the point of view of robustness of 
the front ends with respect to inter speaker variations and noise 



2 THE FRONT END PROCESSORS 


In this chapter we will discuss the fundamental properties of the linear prediction 
filterbanks and scale transform Also a complete description of the front end processors 
IS given 


2 1 LPC FRONT END PROCESSOR 

The theory of linear prediction as applied to speech has been well understood for many 
years In this section we describe the basics of how LPC has been applied in speech 
recognition systems The mathematical details and denvations will be omitted here the 
interested reader is referred to [1] of the Bibliography 

2 11 The LPC model 

The basic idea behind the LPC model is that a given speech signal at time n s{n) can be 
approximated as a linear combination of the past p samples such that 

5{n)»a]5{n-l)+n25{n-2)+ -Hip4n-p) (2 11) 

where the coefficients a^ are assumed constant over the speech analysis frame 

We convert Eq (2 1 1) to an equality by including an excitation term Gu{n) giving 

sin) = ^a sin- 1 ) + Guin) (2 1 2) 

(=1 
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where u{n) is a normalized excitation and G is the gain of the excitation By expressing 
Eq (2 1 2) in the z domain we get the relation 

p 


S{z) = '^az S{z) + Gu(z) 


(2 1 3) 


leading to the transfer function 

5(z) 


H{z) = 


Gu(z) ^ _ -^( 2 ) 

2^a z 


(2 14 ) 


=1 


The interpretation of Eq (2 1 4) is given in Figure 2 1 which shows the normalized, 
excitation source u{n) being scaled by gam G and acting as input to the all-pole 

system H{z) = — to produce the speech signal s{n) 

A{z) 



Figure 2 1 LPC model for speech 


2 1 2 LPC Analysis Equations 

Based on the model of Figure 2 1 the exact relation between and uin) is 

s{n) = ^a^s{n ~k) + Gu{n) (2 1 5) 

*=i 

We consider the linear combination of past speech samples as the estimate ?(n) defined 
as 
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(2 16 ) 


s{n) = ^a^sin-k) 

k=l 

We now form the prediction error ein) defined as 

p 

e{n) = sin) -s(n) = s(n) -^Oj^sin-k) (2 17) 

t=i 

with error transfer function 

= = (2 18 ) 

The basic problem of linear prediction analysis is to determine the set of predictor 
coefficients } directly from the speech signal so that the spectral properties of the 

digital filter match those of the speech waveform within the analysis window The basic 
approach is to find a set of predictor coefficients that minimize the mean squared 
prediction error over a short segment of speech 


To set up the equations that must be solved to detenmne the predictor coefficients we 
define the short term speech and error segments at time n as 

s (m) = sin+m) 

.... ( 219 ) 

e \m) = e{n+m) 


and we seek to minimize the mean squared error at time n 

E„ 

m 

which using the definition of e (m) in terms of s (m) can be wntten as 


(2 1 10 ) 


E =X s im)-^a^s„im-k) 

m _ t=l 


(2 1 11 ) 


To solve Eq (2 1 11) for predictor coefficients we differentiate E with respect to each 


and set the result to zero 


^ = 0 fc = 12 p 


(2 1 12 ) 
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giving 


where 


^ = (i k) 

k 1 

(p {i k) = '^s (m-i)s (m-k) 


IS the short term covariance of s (m) 


(2 1 13) 


(2 1 14) 


To solve Eq (2 1 13) for the optimum predictor coefficients we have to compute 
<p (i k) for I <i < p and 0< k < p and then solve the resulting set of p equations In 
practice the method of solving equations is a strong function of the range of m used in 
defining both the section of speech for analysis and the region over which the mean 
squared error is computed There are two standard methods of defining this range of 
speech autocorrelation method and covanance method The covanance method is 
generally not used for speech recognition system Hence we will not discuss this method 
but instead will concentrate on the autocorrelation method of LPC analysis for the 
remainder of the chapter For covariance method the reader may refer to [1] 


2 1 3 The Autocorrelation method 


A fairly simple and straightforwiird way of defining the limits on m is to assume that the 
speech segment is identically zero outside the interval 0 < m < AT - 1 This is equivalent 
to assuming that the speech signal s{n + m) is multiplied by a finite window w{m) 
which is identically zero outside the range 0<m< N -1 Thus the speech sample for 
minimization can be expressed as 

Um + n) w(m) 0<m<N-l i 

= jo chemtse 

The effect of the weighting of the speech by a window is illustrated in Figure 2 2 In this 
figure the upper panel shows the speech waveform the middle panel shows the weighted 
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sccUon oi speech ind the bottom panel shows the eiroi signal based on the optimum 
selection oi piedicloi coefficients 



Figure 2 2 The error signal for autocorrelation method 


Based on Eq 2 115 for m(0 the error signal is exactly zero since s^{m) = 0 for all 
m{0 and therefore there is no prediction error Furthermore for m)N -1 + p there is no 
prediction error because s (m) = 0 for all m)N — I Eq (2110) now become 

E (2 116) 

m 0 


N-l (- 1 ) 

^{ik)= (m)5„(m + t-fe) \<i<p 

m 0 

0</c<p 


(2 1 17) 
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sincL Lq (2 ] 17) is only a function of i - k the covariance function (j) {i k) i educes to 
simple lutocoi relation function le 

A 1 ( -/o 

(f) {i k) = r {i-k)= {m)s (m + i- k) (2 118 ) 

=0 

since the autocori elation function is symmetric i e r {-k) = r {k) the LPC equations 
can be expiessed as 

Yr{\i-k\)a,=r(_i) \<i<p (2 119 ) 

e=i 


and can be expressed in the matrix form as 


■ r(0) 

r(l) 

r(p-lf 

■a.- 



r(l) 

r(0) 

r{p-2) 



r(2) 

/(p~\) r{p-2) 

r{0) . 

[^P\ 


/ (P). 


(2 1 20 ) 


The px p matrix of autocorrelation values is a Toeplitz matrix and hence can be solved 


efficiently through well known procedures One of the commonly used procedure is 
Levinson Durbin s algorithm and is explained in the next section 


2 1 4 LPC processor for speech recognition 

Now we describe the details of the LPC front end processor that has been widely used in 
speech lecognition systems Figure 2 3 shows a block diagram of the LPC processor The 
basic steps in the processing include the following 

2 14 1 Preemphasis 

The digitized speech signal sin) is put through a low order digital system to spectrally 
flatten the signal The digital system used is either fixed or slowly adaptive Perhaps the 
most widely used preemphasis network is fixed first order system 

H{z) = l-az'' 09<a<10 (2 121) 

In this case the output of the preemphasis network sin) is related to the input to the 
network sin) by the difference equation 
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\i{n)= \,{n)w{n) Q<n<N-] (2 124) 

A typic il window used for nutocoirelation method of LPC is the Hamming window 
which h IS the foim 


w{n) 


054-046cos 


^ 2mi 


0<n<N — \ 


(2 1 25) 


2 14 4 Autocorrelation analysis 

Each frame of windowed signal is next autocorrelated to give 

N 1-n 

'^Xi{n)x,{n + m) m = 0\ p (2 126) 

n 0 

where the highest correlation value p is the order of the LPC analysis Typically values 
of p from 8 to 16 are used in speech recognition systems 


2 14 5 LPC Analysis 

The next processing step is the LPC analysis which converts each frame of p + l 
coefficients into LPC coefficients The formal method for converting from 
autocorrelation coefficients to LPC coefficients is known as the Durbin s method and can 
foimally be given as the following algorithm 
= r(0) 

^ =|r l<r<;7 

a' > = k (2 1 27) 

The set of equations (2 1 27) are solved recursively for i = 1 2 p and the final solution 
IS given as 

a = LPC coefficients = \<m< p (2 1 28) 

2 14 6 LPC parameter conversion to cepstral coefficients 

A very important LPC parameter set which can be denved directly from LPC coefficient 
set IS the LPC cepstral coefficients c(m) The recursion used is 
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c,) = In O' 


c 


c 



\<m< p 
m)p 


(2 1 29) 


wheie O' is the gam term in LPC model The cepstral coefficients have been shown to be 
moie robust reliable feature set for speech recognition than LPC coefficients[]] 


2 2 BANK OF FILTERS FRONT END PROCESSOR 

In this section we will first describe how this model reduces the data rate by 
decomposing the signal into different frequency bands We will also descnbe the different 
types of filter banks that can be used and the various signal processing operations done in 
the actual filter bank model based front end processor 


2 2 1 Filter bank analysis equations 


A block diagiam of the canonic stmcture of a complete filter bank is given in Figure 2 4 



Figure 2 4 Canonic structure of a filter bank 


The sampled speech signal j(/i) is passed through a bank of Q bandpass filters giving 
the signals 
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s in) = sin)* h in) ]<i<Q 


(2 2 1a) 


M 1 

= '^h{m)s(n~m) (2 2 1b) 

0 

wheic we have assumed that the impulse response of the /" bandpass filter is h (m) with 
a duiition of M samples hence we use the convolution representation of the filtering 
opeiation to give an explicit expression for ^ (n) the bandpass filtered speech signal 
Since the puipose of the filter bank analyzer is to give a measurement of the energy of the 
speech signal in a given frequency band each of the bandpass signals is passed through a 
non linearity such as the full wave or half wave rectifier The non linearity shifts the 
bandpass signal spectrum to the low frequency band as well as creates high frequency 
images A low pass filter is used to eliminate the high frequency images giving a set of 
signals t in) \<i <Q which represents an estimate of the speech signal energy in each 
of the Q frequency bands The final two blocks of the bank of filter model are a 
sampling rate reduction block in which the lowpass filtered signals t [n) are resampled 
at the rate on the order of 40-60Hz and the signal dynamic range is compressed using 
an amplitude compression scheme (e g logarithmic encoding) 

Now consider the design of a Q = \6 channel filter bank for a wideband speech 
signal wheie the highest frequency signal of interest is SKHz Assume we use a sampling 
I ate of F = 20KHz on the speech data to minimize the effects of aliasing in analog to 
digital conversion The information rate of the raw speech signal is on the order of 240 
Kbits per second (20 k samples per second times 12 bits per sample) At the output of the 
analyzer if we use a sampling rate of 50Hz and we use a 7 bit logarithmic amplitude 
compressor we get an information rate of 16 channels times 50 samples per second per 
channel times 7 bits per sample or 5600 bits per second Thus for this simple example 
we have achieved about a 40 to 1 reduction m bit rate and hopefully such a data 
reduction would result in an improved representation of the significant information in the 
speech signal 
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2 2 2 Types of filter bank used for speech recognition 


The most common type of filter bank used for speech recognition is the uniform filter 
b ink toi which the center frequency / of the T' bandpass filter is defined as 


/,= 



i<i<(2 


(2 2 2 ) 


wheie F is the sampling rate of the speech signal and N is the uniformly spaced filters 
required to span the frequency range of speech The actual number of filters used in the 
filtei bank Q satisfies the relation 


Q < N/2 (2 2 3) 

with equality when the entire frequency range of the speech signal is used in the analysis 
The bandwidth b of the i " filter generally satisfies the property 


b > 


£_ 

N 


(2 2 4) 


with equality meaning that there is no frequency overlap and with inequality meaning that 

F 

adjacent filter channels overlap (If b <— then certain portion of the speech spectrum 

would be missing from the analysis and the resulting speech spectrum would not be 
consideied very meaningful ) Figure 2 5 shows a set of Q realistic bandpass filters 


covering the range from F 


lN[y^ to (^-/wXe+K) 



Figure 2 5 Frequency response of a uniform filter bank 
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The iltei native to uniform fillerbank is non unifoim filterbank designed according to 
some ciitciion foi how the individual filters should be spaced in frequency One 
commonly used criterion is to space the filters uniformly along the logarithmic frequency 
sc lie which is often justified from a human auditory perception point of view Thus for a 
set of Q bandpass filters with center frequencies / and bandwidths h \ <i < Q we 
set 

b^=C (2 2 5a) 

b=ab_, 2<i<Q (2 2 5b) 

b -b, 

/=/i+X^+— (2 2 5c) 

where C and /, are the arbitrary bandwidth and center frequency of the first filter and 

a IS the logarithmic growth factor Figure 2 6 shows the set of realistic filters for this 
filterbank 


1=1 1=2 i=Q 



Figure 2 6 Frequency response of a non uniform filter bank 


An alternative criterion for designing a non uniform filterbank is to use the critical 
band scale directly The spacing of the filters along the critical band is based on the 
peiceptual studies and is intended to choose bands that give equal contribution to speech 
articulation The general shape of the critical band scale is given in Figure 2 7 The scale 
is close to linear for frequencies below about lOOOHz (i e the bandwidth is essentially a 
constant as a function of / ) and is close to logarithmic for frequencies above lOOO/fz 
(i e the bandwidth is essentially exponential as a function of / ) 
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■t 1 1 f — I 1 1 \ 1 1 1 1 h 

1000 Hz 

Cent Frequency 


Figure 2 7 Critical band scale 


A slight variation of this critical band scale the mel scale is the most widely used and 
studied filterbank methods For the remainder of this chapter we will concentrate on the 
mel scale based filterbank model 


2 2 3 Mel scale based filterbank model for speech recognition 

Figure 2 8 shows a block diagram of this processor 


M N W( ) FJLTCR COEFFICIENTS 



Figure 2 8 Block diagram of a mel scale processor for speech recogmbon 


The signal processing operations involved in this processor are the following 
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2 2 3 1 Preemphasis frame blocking and windowing 

The fust three operations aie essentially the same as in the LPC processoi and has been 
Lxpl uned in details in the LPC section earlier 

2 2 3 2 Calculation of energy vector 

Each of the windowed waveform segment is transformed into the frequency domain by 
computing the FFT of the corresponding waveform A vector of log energies is then 
computed from each waveform segment by weighting the FFT coefficients by the 
magnitude frequency response of the filterbank The log energies are taken for the 
puipose of dynamic range compression and also in order to make the statistics of the 
estimated speech power spectrum approximately Gaussian 

2 2 3 3 Computing DCT 

The final processing stage is to apply the discrete cosine transform (DCT) to the log 
energy coefficients This has the effect of compressing the spectral information into the 
lower order coefficients and it also decorrelates them to allow the subsequent statistical 
modelling to use diagonal covariance matnees 


2 3 THE SCALE TRANSFORM BASED FRONT END PROCESSOR 

Now we will discuss yet another type of front end processor that has been recently 
pioposed as an alternative to both mel scale and LPC models [5] But before that it is 
worthwhile to have a look at some of the important properties of the scale transform 


2 3 1 The scale transform 

The scale transform based cepstrum as it was earlier mentioned is motivated by speaker 
normalization techniques Such normalization techniques are necessary since different 
speakers have different formant frequencies for the same vowels A popular procedure for 
normalization is based on the assumption that the formant values of any given speaker are 
appioximately a multiplicative scale factor times the formant values of any other speaker 
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loi I oi\c.n vowel In other vvords the foimant fiequency of two speakers A and B for 
iny \owcl aie i elated by 

(2 3 1) 

wheie (x^/i is the scale factor The scale transform[9] is a useful tool to apply in these 
situations because it provides scale invariant analysis 


Now we Will briefly review some important properties of the scale transform The 
scale transfoim of a function of frequency x(f) is given by 


j nc\ f 


and inversely 

“ +j2jtcl f 

X(/)=jD,(c)— V/>0 
Now consider the scale transform of VaX(c^) i e 

^ j2ncln/ 

DUc) = l^^X{of)~^df 


Making the substitution of variables / = cgf we have 

-j2ii 1 f 


D;(c) = e 


+ j'’Kc\n 




vr 


■df 


(2 3 2) 


(2 3 3) 


(2 3 4) 


(2 3 5) 


^g+j2^1 a^^(^) ^2 3 6 ) 

hence the magnitude of the scale transform of x{f) and its scaled version are the same 

since the scaling constant ct is a part of the phase expression and does not appear m the 
magnitude of the scale transform It was pointed out in [5] that with respect to the speaker 
normalization if one were to compute the magnitude of the scale transform of the 
foimant envelope then all speaker dependent scaling constant that appear in the phase 
term would be removed 
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The scale transform may also be computed as the Founer transform of the function 
X(e^)e^l^ le 

(c) = J X(e^ (2 3 7) 

It may be noted that as a result of log warping le forming x[e^) the speaker specific 
scale constant a is purely a function of the translation parameter in the log warped 
domain This may be easily seen by considenng 

xXf)=x{e^) (2 3 8) 

X,(/)= = X(e^^^y = X,{f + \oga) (2 3 9) 

Therefore if there are two formant envelopes that are related by a pure scaling constant 
that IS independent of frequency but is dependent on the pair of speakers then in the log- 
warped domain the envelopes are the same except for a translation factor dependent on 
a 


In subsequent work [6] it was pointed out that the scale factor is dependent on 
frequency Based on some experimental results the frequency region of interest is divided 
into logarithmically equal bands and it is assumed that in each such frequency band the 
formant envelope of any two speakers are scaled versions of each other In other words 
the formant envelope of two speakers for the same utterance are assumed to be of the 
form 

A(f) = B[a[\f) fsi'^band (2 3 10) 

and 1 = 1 2 N where N is the number of bands 
Further rewriting Eq (2 3 10) as 

a(/) = V) / ^ I'^band (2 3 11) 


where 




= «a". (2 3 12) 

Note that is a constant independent of i and is dependent on the pair of speakers 
while P depends only on the i"' frequency band and is independent of pair of speakers 
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in 


Lcl A{f)= sialyl / ) be a scaled version of B{f ) where a[l is the scaling factoi 
the i" fiequency band If one were to exponentially sample A{f) at Av spacing in the 
/" fiequency band we have 


A\e 


A\ +1 g(L,) 


) = 


B\ e 


A +! g d!,’ +1 g{i,) 


m = 0 1 M - 1 


(2 3 13) 


wheie 


Av = 


log(t/)-log(L) 


M 


U L and M are respectively the upper frequency limit lower frequency limit and 

number of equally spaced samples of the i"' band 
Rewriting Eq (2 3 13) as 


A{e 


f . I E(«iij)l 




Av 


Av+l g(i,) 


(2 3 14) 


Hence 


and IS translated by 


A[m ] = 5 

logWi) 


m + 


log(aiJ) 


Av 


(2 3 15) 


Av, 


samples 


Fiom Eq (2 3 12) we have 

* - . ~ A - . ' ” Av 


Av 


If 


Av 


Av, Av, 

^- = - — ^ = Av 


(2 3 16) 


1 + A ^ + A 


(2 3 17) 


log(o:^g ) 


then we have equal translation of in all frequency bands Recalling that we 

have chosen frequency bands that are equally spaced on loganthmic scale we have 
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(2 3 18) 


log 


J 


= log 


(U_,^ 


Hence Eq (2 3 17) will be satisfied if 


+ =(] + ^Jm^ (2 3 19) 

Theiefoie we use diffeient sampling rates in different frequency bands to achieve 
w irping and the resulting sequences are translated versions of each other For details on 
computation of p refer to [6] 


We have not gone into greater detail here because our goal was to make the reader 
familial with the properties of scale transform The interested reader is urged to pursue 
this fascinating area further by studying the material m [5 6 7]of the Bibliography at the 
end of this report 


2 3 2 Scale transform based front end processor 

Aftei a brief introduction of the scale transform we now describe the steps that are 
involved in the scale transform based front end processor 


IN M N W( ) W( ) 



Figure 2 9 Block diagram of a scale cepstrum based front end processor 


2 3 2 1 Preemphasis and frame blocking 

The first two operations are essentially the same as m the LPC processor and has been 
explained m details m the LPC section earlier 
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2 3 2 2 Estimation of the formant envelope 

In the spectral domain the speech signal corresponds to the product of the spectrum of 
the vocal tract filter and the spectrum of the pitch i e 

v(f)=F(f)p(f) (2 3 20) 

where V{f) F{f) and p[f) are the spectra of the voiced utterance vocal tract and 

pitch excitation respectively Since we are interested only m the vocal tract response we 
would like to remove the effects of pitch excitation The following procedure motivated 
by Nelson s work [17] is used to suppress the effects of pitch Each frame of speech is 
segmented into Q overlapping subt'rames and each subframe is Hanning windowed We 
estimate the sample autocorrelation for each subframe average over the available Q 
subframes This averaged autocorrelation function is then Hanning windowed and is used 
to compute the scale cepstmm described in the next section We denote the windowed 
average autocorrelation estimate as 


2 3 2 3 Computing the scale cepstrum 

The scale cepstmm is obtained by computing the scale transform of 


log|5'(/)| and is 


denoted by D (c) where s{f) is the Founer transform of 5{/] the windowed averaged 


autocorrelation estimate For computing the scale cepstmm we warp the frequency axis to 
logarithmic scale le form log|5(e^)| from log|5(/)j multiply with preemphasis vector 
and compute the Fourier transform to get D (c) The magnitude of the scale 
cepstrum \d (c)| is then used as a feature vector for speaker independent recognition 
systems 
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3 VOWEL CLASSIFICATION 


In this chapter we compare the three front end processors by evaluating their 
peiformance in vowel classification We also describe the procedure for the selection of 
data the actual values of the parameters used in the front end processors and the methods 
for the comparison of the features Finally the results obtained from the classification 
experiments have been tabulated 


3 1 DATABASES 

Foui different databases RS I RS2 TS I and TS2 are used to test the performance of the 
lecognizer based on the three front end processors The data for each database is selected 
from the dialect region 7 (further divided into drTtrain and dr7test) of the TIMIT 
database The databases consists of the utterances of 10 vowels /aa/ /ae/ /ao/ /ax/ /eh/ 
/ei/ /ey/ /ih/ /ly/ and /ow/ by male and female speakers Each utterance is so chosen that 
the corresponding phoneme is relatively stationary over at least 768 samples and the 
middle 512 samples are used in the computation of the features The noisy utterance is 
simulated by adding artificially generated white Gaussian noise In our experiments we 
used clean and noisy speech at 15 db SNR 

1 RS 1 The data for this database was selected from dr7tram and consists of all possible 
utteiances of the vowels /aa/ /ae/ /ao/ /ax/ /eh/ /er/ /ey/ /ih/ hy/ and /ow/ by 20 
male speakers and 18 female speakers 
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2 RS2 The dill foi this database was also selected fiom drTtrain and consists of 50 
utteiances ol each vowel by male speakers and 50 utterances by female speakers This 
IS done in oidei to give equal weightage to all vowels 

3 TSl For this database we selected the data from dr7test of dialect region 7 and 
included all possible utterances of ten vowels by 15 male speakers and 8 female 
speakeis 

4 TS2 This database is same as TSl database except that 40 utterances by male 
speakers and 35 utterances by female speakers were taken for each vowel 


3 2 THE VOWEL RECOGNITION SYSTEM 


As shown in the Figure 3 1 there are three considerations in implementing a vowel 
recognition system namely generation of the features creation of the reference patterns 
and selection of the pattern similarity measure 



Figure 3 1 Block diagram of a vowel classification system 


3 2 1 Generation of the features 

The detailed description of the various front end processors is given in chapter 2 Here we 
give the values of the various parameters used in the processors and describe the 
procedure for generation of the feature vectors 
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For obtaining the LPC features the speech waveform is preemphasized by a first order 
FER digital filter with a =091 We took the frame size as 512 which gives one frame per 
utterance The frame is weighted with hamming window and p + 1 autocorrelation values 
are computed These p + 1 autocorrelation values are then converted into p LPC 
coefficients where p = 9 These 9 LPC coefficients are converted into 12 LPC cepstral 
coefficients which are used as the feature vector for vowel recogmtion 

For implementing mel cepstmm the first three steps namely preemphasis frame 
blocking and weighting with hamming window are same as explained in previous 
paragraph We then take the Fourier transform of the corresponding waveform to obtain 
the FFT coefficients A vector of energies is computed by weighting the FFT coefficients 
with the magnitude response of the filter bank made up of 40 triangular filters The center 
frequencies of the first 13 linearly spaced filters are 66 67Hz apart starting at 133 34Hz 
The center frequencies of the other 27 filters are chosen to have a ratio of 1 0711703 
between successive filters The mel cepstmm coefficients are then obtained by computing 
the discrete cosine transform of the vector of log energies 

For obtaining the scale cepstrum the first step is preemphasizing and frame blocking 
as usual The next step is to obtain the smoothed formant envelope which is obtained 
using the method described in the previous chapter We have chosen the subframes to be 
96 samples long and the overlap between the subframes is 64 samples resulting m 13 
subfr imes Since the sampling frequency of the TMIT database is 16KHz we assume the 
signal IS bandlimited between lOOHz and 7000Hz The scale cepstrum for log warping 
may therefore be represented as 
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which IS the conventional Fourier transform of log[|j’(e'' )|je For digital 

implementation we sample in v domain and obtain an expression which can be easily 
implemented using the Fast Fourier Transform (FFT) le (see [5] for details) 


D 


k 

~N 


K 1 




m=0 


Vv+1 (100) 


mVv+l (100) 


2 


k =01 {N-l) (322) 

, _ ln(7000) - In(lOO) 1 p‘(ioo) 

where Vv - ^'’~Vv phase term e ^ can be 


Ignored since it does not contribute to the magnitude of 



k c I 

D 

p 

_ N J 


5 ^gmVv+i(ioo)J can be 


easily computed from the time lag samples of the smoothed formant envelope sf/] as 


-0 

where T is the sampling period in the time lag domain 


(3 2 3) 


Recalling that for making the scale factor ot independent of frequency we divide the 
frequency band between lOOHz and 7000Hz into five loganthmically equal bands of 
[100 240)Hz [240 550)Hz [550 1280)Hz [1280 3000)Hz and [3000 7000)Hz The 
scale cepstrum is then computed using the following values of the parameters A" =128 

L=191 =256 and r = We have used two warping functions namely warp 1 

and warp2 In warp 1 the number of samples in each of the five frequency bands is given 
by A/|=9 M2=12 ^3=21 Af4=35 Mj=51 and in warp2 the values are M, =40 

M2=32 Af3=23 M^=\S Afj =15 Note that £) 2 _[fc ] is interpolated by a factor of two 


In all the cepstra the zeroeth coefficient is not used since this is roughly a measure of 
the spectral energy Coefficients 1 to 12 are used to measure of separability between the 
different phoneme classes 
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3 2 2 Forming reference patterns 


The next step aftei the generation of the feature vectors for all utterances is to form the 
icleience patterns foi all vowels Let F/ ^ denote the feature vector for k'' utterance and 
N denote the total number of such utterances from the i"" phoneme class Then the 

lefeience pattern for the i" phoneme class which is the mean feature vector is 

given by 

M'>=— Tf/> 1 = 12 / (3 2 4) 

^ JL-l 

where I is the number of phoneme classes being considered In our case since we are 
considering 10 vowels 7=10 

3 2 3 Comparing the patterns 

The purpose of this block is to measure the dissimilarity or distance between two features 
of the speech A test feature vector is compared with each reference feature vector 
and a distance score is produced In our experiments we used three distance measures 

1 Euclidean distance [d,] The Euclidean distance between the test vector F and 

the lefeience pattern for the i" phoneme class M" is given by 

(3 2 5) 

1=1 

where F, represents the l" coefficient of the feature vector F and Aff ^ represents the 
/"' coefficient m the reference feature vector for the i" phoneme class L is the number 
of coefficients used For our case L=12 

2 Weighted Euclidean distance (d ) For a test vector F and the reference vector M<> 
this distance is given by 
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(3 2 6) 






M V, 

Vi is the /'^ coefficient m the mean vaiiance vector V and is given by 


v, = 'Lv, 


() 


(3 2 7) 


whcie 




N ^ 

•''' r=i 


(3 2 8) 


3 Mih \1 inobis distance using the covariance matrix [d ) Using this as a measure of 
dissimilaiity between the two patterns the distance between test vector F and the 
lefeience pattern for the i'* phoneme class is given by 

=(F-M<>)^C (329) 

where 

C = i[c,]” ;c = 12 L y=12 L (3210) 

and 

[c,]"=±|(Fy-M<')(Fy-M!') (3211) 

F^'^ IS the x'^' coefficient in the feature vector for k''' utterance of the phoneme class i 


3 2 4 Decision rule 

The output of the pattern comparison block is a vector d = i = \2 1 for each 

utterance The utterance is recognized as phoneme class k if 

= min((i^^) (3 2 12) 
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3 3 RESULTS 


The lesuUs of the lecognition tests for databases RSI RS2 TSl and TS2 are given in 
Tible 3 1 32 Figuie 3 2 and 3 3 shows the recognition accuracy of the system as a 
function of the distance measure used for the RSI and TS 1 database respectively 
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Table 3 1 LPC Euclidean train RSI test RSI 
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Table 3 2 Mel Euclidean tram RSI test RSI 
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Table 3 3 Scale(warpl) Euclidean tram RSI test RSI 


30 





TESTING 

CLEAN NOISY 


CLEAN 

3S6] 

22 61 

TRAINING 





NOISY 

19 44 

31 08 


Table 3 4 Scale(warp2) Euclidean tram RSI test RSI 
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Table 3 5 LPC weighted Euclidean train RSI test RSI 
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Table 3 6 Mel weighted Euclidean train RSI test RSI 
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Table 3 1 Scale(warpl), weighted Euclidean train RSI test RSI 
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Table 3 8 Scale(warp2) weighted Euclidean tram RSI test RSI 
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Table 3 9 LPC Mahalanobis train RSI test RSI 
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Table 3 10 Mel Mahalanobis tram RSI test RSI 
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Table 3 11 Scale(warpl) Mahalanobis tram RSI, test RSI 
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Tables 12 Scale(warp2) Mahalanobis tram RSI test RSI 



TESTING 

CLEAN NOISY 


CLEAN 

33 92 

7 73 

TRAINING 





NOISY 

22 73 

40 05 


Table 3 13 LPC, Euclidean train RSI test TSl 
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Table 3 14 Mel Euclidean tram RSI test TSl 
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Table 3 15 Scale(warpl) Euclidean tram RSI test TSl 
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Table 3 16 Scale(warp2) Euclidean train RSI test TSl 
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Table 3 17 LPC weighted Euclidean tram RSI test TSl 
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Table 3 18 Mel weighted Euclidean tram RSI test TSl 
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Table 3 19 Scale(warpl) weighted Euclidean train RSI test TSl 
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Table 3 20 Scale(warp2) weighted Euclidean tram RSI test TSl 
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Table 3 21 LPC Mahalanobis train RSI test TSl 
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Table 3 23 Scale(warpl) Mahalanobis tram RSI test TSl 
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Table 3 24 Scalc(warp2) Mahalanobis tram RSI test TSl 
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Table 3 25 LPC Mahalanobis, tram RS2 test RS2 
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Table 3 26 Mel Mahalanobis tram RS2 test RS2 
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Table 3 21 Scale(warpl), Mahalanobis tram RS2 test RS2 
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Tables 28 Scalc(warp2) Mahalanobis train RS2 test RS2 
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Table 3 29 LPC Mahalanobis train RS2 test TS2 
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Table 3 30 Mel Mahalanobis train RS2 test TS2 
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Table3 31Scale(warpl) Mahalanobis tram RS2 test TS2 
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Tjblc3 32 Scale(warp2) Mahalanobis tram RS2 test TS2 
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Figure 3 2 Recognition rate vs distance measure clean clean condition RSI database 
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Figure 3 3 Recognition rate vs distance measure clean clean condition TSl database 
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Figure 3 4 3 3 shows the performance of the recognition system as a function of signal 
to noise ratio using different front end processors when Mahalanobis distance is used as a 
similaiity measure The training or generation of reference patterns is done on RSI 
database and the testing is done on both RSI and TSl databases 
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The icsults ot Tables 3 1 3 32 show the following 


1 The recognition accuracy ,s a strong taction of the distance measure used The 
accui icy IS highest when the distance d , is used which takes into account the correlation 
between the cepstral coefficients Since using the distance d, gives the best performance 

as compared to the other two henceforth in all our experiments we will use this as a 
measure ot dissimilarity 

2 By using d is the distance measure the performance of the scale transform based 
hont end pioccssor enhances significantly as compared to mel scale and LPC front end 
ptocessor This suggests that the coefficients in the scale cepstrum features are much 
moie correlated as compared to mel scale and LPC features 

3 The lecogmtion rates are higher for database RS2 and TS2 as compared to RSI and 
TS 1 This IS probably due to the reason that vowels Ay/ hhJ /ey/ and /aa/ /ao/ have a 
vciy less separability but occur most frequently in speech Vowels /er/ /ow/ /ax/ base a 
good separability but occur less frequently in speech In database RS2 and TS2 we give 
equal weightage to all vowels the result being higher recognition rates 

4 When the recognition system using LPC features is trained under clean conditions and 
tested in noisy environments the performance drops significantly as compared to 
piocessors using mel scale and scale cepstrum features 

5 For vowel recognition the performance of scale (warpl) features is between that of 
mel b ised and LPC based features under noisy conditions It can be observed from the 
tibles that there is improved performance of the scale based features using warp2 for 
vowel classification Under these conditions the vowel classification of the scale based 
features is comparable to mel based features and better than LPC based features 
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4 ISOLATED DIGIT RECOGNITION 


In this chapter we compare the three front end processors by evaluating their relative 
lecognition performance when used in an isolated digit recognition The isolated digit 
recognizer is based on a vector quantization scheme The data used m the testing consists 
of utterances spoken by both male and female speakers to compare the speaker 
independent recognition performance We also descnbe the database used actual values 
of the parameters used in the front end processors the procedures for the formation of the 
reference patterns and the decision rule used Finally the results obtained with the 
experiments are tabulated 


4 1 DATABASES 

We have formed two digit databases RS and TS from the OGI database obtained from 
Oregon Graduate Institute The speech signal in OGI database is sampled at 8KHz 
limiting the bandwidth of the speech signal to 4KHz We used the end point detection 
algorithm of Rabiner et al [16] for all utterances and found that in most cases the entire 
utterance was included Hence in all our expenments we have included the entire 

utterance in the analysis 

The first database for the digits vocabulary which we call the RS database consisted 
of a set of 75 speakers (from OGI database) made up of almost equal number of male and 
female speakers Each talker spoke each digit once giving 10 total utterances per talker 
and a total of 750 utterances for the RS database The second database TS was also 
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generated by a set of 75 speakers evenly divided between male and female speakers 
These 75 talkers were different from the 75 speakers who generated the RS database 
Every speaker spoke each digit in the vocabulary once resulting in 750 utterances for the 
TS database 


4 2 THE DIGIT RECOGNITION SYSTEM 



Figure 4 1 Block diagram of an isolated digit recognizer 


Figure 4 1 shows the block diagram of the digit recognition system The system operates 
in three modes namely training template creation or clustenng and testing In the 
training mode the RS database is used to create the feature vectors for each digit and then 
stored for future use by the clustenng package In the second mode the template creation 
mode the stored data for each digit in the vocabulary is sent to a clustenng algorithm 
which divides it into different clusters The reference patterns are defined to be the center 
points of the clusters that are found The clustering is performed for all digits of the 
vocabulary The third mode of the system is the testing or the usage mode in which the 
unknown test word that is spoken is analyzed and compared to each of the reference 
templates using a distance measure A vector of distance score is computed for each 
candidate and finally a decision as to which class it belongs to is taken on the basis of 
this score 
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4 2 1 Model The training mode 


The procedure for generation of feature vectors with LPC mel scale and scale transform 
IS the same as explained previously except that some values of the parameters are 
different since we are now using speech sampled at 8 KHz and bandlimited to 4KHz 
Rather than explaining the whole procedure we just enumerate the differences Recall in 
the previous section each utterance consisted of one frame of 512 samples However 
each utterance of a digit consists of several phonemes and lasts for a few milliseconds 
We therefore have to block the utterance into frames and compute the feature vector for 
each frame Hence corresponding to each utterance of a digit we have a sequence of 
frame vectors In this section the speech signal is blocked into frames of 320 samples 
with adjacent frames being separated by 106 samples These correspond to 40msec 
frames separated by 13 25msec 

1 The LPC parameters remain the same as before 

2 In the mel scale front end processor firstly we use frames of 320 samples separated by 
106 samples and secondly instead of the 40 filters 32 filters are used The first 13 are 
linearly spaced as usual and other 19 are logarithmically spaced This is done because the 
spectrum of the speech signal is limited to 4KHz 

3 The first difference m the scale transform is the same as above Secondly the frequency 

region [100 3900]Hz is divided into five bands of [100 240)Hz [240 550)Hz 

[550 1280)Hz [1280 3000)Hz and [3000 3900]Hz The number of samples in each of the 
five frequency bands IS Af, =9 M, = 12 M,=^2l M, =35 and M 5 =16 for warp 1 and the 
values are M, =40 M,=32 M,=23 M ,=18 and M,=15 for warp2 It is to be noted 
that the sampling rate in the fifth frequency band will remain the same as before 

4 It has been shown that adding transitional spectral information [14 15] to the 
instantaneous feature vectors improves the recognition rate significantly Furui [14] 
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suggested the use of orthogonal polynomials to charactenze the time trajectones of 
cepstial coefficients over a finite length time window to characterize the transitional 
spectral information The linear regression coefficient namely the first order orthogonal 
polynomial coefficient is 


a. 




(4 2 1) 


v= V ; 

where F (n) is the m'^ coefficient in the n"' feature vector The length of the interval is 
set to 7 frames Accordingly n is equal to 3 The 118msec interval seems adequate for 
preserving transitional information associated with changes from one phoneme to 
another The utterance at time t is then represented by cepstmm coefficients o 


and the regression coefficients where t is the frame number Since the 

regression coefficients a^{t) cannot be defined within the three frame intervals at the 
beginning and end of speech period these three frame intervals are eliminated from the 
speech period It doesnot cause the elimination of the actual utterance because of the short 
silence intervals that are present at the beginning and end of utterances 


4 2 2 Mode2 The clustering mode 

The basic idea of the clustering is to reduce the information rate of the speech signal to a 
low rate through the use of a codebook with a relatively small number of codewords or 
leference patterns The goal is to be able to represent the spectral information of the 
signal in an efficient manner and yet preserve all the characten sties of the signal The way 
m which a set of L training vectors can be clustered into a set of M codebook vectors is 
the following ( this procedure is known as generalized Lloyd algorithm or K means 

clustering algorithm) 

1 Initialization Arbitrarily choose M vectors ( initially out of the training set of L 
vectors) as the initial set of codewords m the codebook 
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2 Nearest neighbour search For each training vector find the codewords in the codebook 
that IS closest and assign that vector to the corresponding cell 

3 Centroid update Update the codewords m each cell using the centroid of the training 
vectors assigned to that cell 

4 Iteration Repeat steps 2 and 3 until the average distortion falls below a preset 
threshold 

Although the above iterative procedure works well it has been shown that it is 
advantageous to design an M vector codebook in stages i e by first designing a 1 vector 
codebook then using a splitting technique on the codewords to initialize the search for a 
2 vector codebook and continuing the splitting process until the desired M vector 
codebook is obtained This procedure is called the binary split algonthm and is formally 
implemented by the following procedure 

1 Design a 1 vector codebook this is the centroid of the entire set of training vectors 

2 Double the size of the codebook by splitting each current codebook y„ according to the 

rule 

jy-*- = y^(l + e) (4 2 2) 

y” = y (l — e) (4 2 3) 

where n varies from I to current size of the codebook and e is the splitting parameter 
We use e =0 01 

3 Use the K means iterative algorithm to get the best set of centroids for the split 
codebook 

4 Iterate steps 2 and 3 until a codebook of size M is designed 

We tried different values of M and found that a value of M=16 suits the best both 
fiom the point of view of the recognition performance and computational complexity 
Hence each digit is represented by Af such codewords 
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4 2 3 Modes The testing mode 


The last mode is the testing or the usage mode in which the unknown utterance is 
compared with each of the codewords in the codebook for all digits In the previous 
chapter we used three distance measures and from the results it was seen that using 
as the distance measure gives least error rates Hence m all subsequent experiments we 
use this as a measure of dissimilarity 

Let the feature vector at time t = q ht represented by F{q) and q = \2 Q where 
Q IS the total number of frames when the utterance is blocked into frames of size 320 
samples with a separation of 106 samples Let the j"' codeword in the codebook for the 
i ^ digit be represented by ^ then the distance between F(q) and Fj ^ is represented 

by d{F{q) where subscript wc has been oimtted For all the codewords 
j = 1 2 M where M is the total number of codewords m the codebook we find 

and 

(425) 

?=i 

The value is calculated for all digits i = 1 2 7 and the decision rule is 

the recognized digit is k if 

D^«=min{D^>) (4 2 6) 

where k" codebook is for digit k 
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4 3 RESULTS 


To evaluate the accuracy of the digit recognizer using the different front end processors a 
senes of digit recognition expenments was performed The measure of the performance 
was the overall recognition accuracy 

Three experiments were mn on the databases RS and TS The expenments were as 
follows 

1 Experiment no 1 No transitional spectral information was included while performing 
this experiment Table 4 16 tabulates the recognition accuracy of the system using 
different front end processors Noisy speech at the signal to noise ratio of 15db was used 



TESTING 

CLEAN NOISY 




CLEAN 

92 53 

16 00 

TRAINING 





NOISY 

43 46 

92 80 


Table 4 1 LPC WTI TRAIN RS TEST RS 



TESTING 

CLEAN NOISY 


CLEAN 

95 60 

10 53 

TRAINING 





NOISY 

10 26 

89 73 


Table 4 2 MEL WTI TRAIN RS TEST RS 
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TESTING 

CLEAN 

NOISY 


CLEAN 

92 13 

30 40 

TRAINING 





NOISY 

11 33 

89 46 

Table 4 3 SCALE 

WTI TRAIN RS TEST RS 



TESTING 




CLEAN 

NOISY 


CLEAN 

85 46 

13 86 

TRAINING 





NOISY 

36 00 

85 06 


Table 4 4 LPC WTI TRAIN RS TEST TS 



TESTING 

CLEAN NOISY 


CLEAN 

82 93 

10 26 

TRAINING 





NOISY 

1013 

78 80 


Table 4 5 MEL WTI TRAIN RS TEST TS 



TESTING 



CLEAN 

NOISY 


CLEAN 

82 53 

28 00 

TRAINING 





NOISY 

10 53 

79 73 


Table 4 6 SCALE WTI TRAIN RS TEST TS 


2 Experiment no 2 In this experiment the transitional spectral information was included 
as suggested by Furui [14] giving a total of 24 coefficients in each feature vector Noisy 
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speech at the signal to noise ratio of 15db was used Tables 4 7 12 shows the recognition 
accuracy 



TESTING 

CLEAN NOISY 


CLEAN 

98 00 

10 13 

TRAINING 





NOISY 

82 00 

98 00 


Table 4 7 LPC ITI TRAIN RS TEST RS 



TESTING 

CLEAN NOISY 


CLEAN 

98 26 

49 06 

TRAINING 





NOISY 

12 66 

98 00 


Table 4 8 MEL ITT TRAIN RS TEST RS 



TESTING 

CLEAN NOISY 


CLEAN 

97 06 

66 13 

TRAINING 





NOISY 

1600 

96 13 


Table 4 9 SCALE ITI TRAIN RS TEST RS 



TESTING 

CLEAN NOISY 


CLEAN 

94 26 

10 26 

TRAINING 





NOISY 

76 00 

94 13 


Table 4 10 LPC ITI, TRAIN RS TEST TS 
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TESTING 




CLEAN 

NOISY 


CLEAN 

95 33 

36 00 

TRAINING 





NOISY 

11 33 

92 66 

Table 4 11 MEL 

m TRAIN RS TEST TS 



TESTING 




CLEAN 

NOISY 


CLEAN 

94 93 

59 86 

TRAINING 





NOISY 

13 86 

91 86 


Table 4 12 SCALE ITI TRAIN RS TEST TS 


3 Experiment no 3 The performance of the recognition system as a function of signal 
to noise ratio using different front end processors is evaluated The training or generation 
of codewords is always done on the RS database and testing is done on both RS and TS 
databases Table 4 13 and Figure 4 2 shows the results when the recognizer is tested on 
RS database Table 4 14 and Figure shows the same for the TS database Under warp2 
we have repeated the experiment only for clean speech and 15db SNR and the recognition 
lates obtained are 85 20 and 74 8 for RS database and 72 o3 and 63 46 for TS database 
respectively This experiment mimics a practical speaker independent speech recognition 
environment It is generally difficult to know in advance the noise environment in which 
test utterance will be spoken Therefore the training is usually done under clean 
conditions and the features have to be robust to variations in environmental noise 
Further in practice the test utterance will be spoken by person other than trainer and the 
feature have to be robust to inter speaker variations 
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SNR 

LPC 

MEL 

SCALE 

SCALE 

CLEAN 

98 00 

98 26 

97 06 

30db 

40 93 

96 13 

93 06 

24db 

1600 

86 93 

89 73 

18db 

10 53 

65 06 

78 53 

I5db 

1 

10 13 

49 06 

66 13 

12db 

10 13 

36 53 

50 00 

6db 

1000 

17 33 

29 33 


Table 4 13 RECOGNITION RATE vs SNR TRAIN (CLEAN CONDITION RS) TEST RS 


SNR 

LPC 

MEL 

SCALE 

SCALE 

CLEAN 

94 26 

95 33 

94 93 

30db 

28 93 

89 73 

90 13 

24db 

14 13 

76 53 

84 40 

18db 

11 20 

51 33 

70 66 

15db 

10 26 

36 00 

59 86 

12db 

10 13 

27 06 

43 46 

6db 

10 00 

15 33 

26 66 


Table 4 14 RECOGNITION RATE vs SNR TRAIN (CLEAN CONDITION RS) TEST TS 


CENTRA - BRARK 

I I T 


12S390 
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Figure 4 2 RECOGNITION RATE vs SNR TRAIN (CLEAN CONDITION RS) TEST RS 



Figure 4 3 RECOGNITION RATE vs SNR TRAIN (CLEAN CONDITION RS) TEST TS 


The results of Table 4 1 14 and Figure 4 2 3 show the following 

1 The performance of the LPC front end processor falls much more rapidly with 
decreasing SNR as compared to the mel scale and scale cepstrum based front-end 
processor 

2 The use of the transitional spectral information helps to improve the recognition 
performance significantly There is an improvement of approximately 5% in the 
recognition accuracy 

3 Even at the signal to noise ratio of 15db (which resembles speech in a very noisy 
environment) the recognition system using scale cepstmm based front end processor 
significantly outperforms the mel scale and LPC based front end processor The 
difference is much more evident in the case of the TS database where the testing set is 
completely different from the training set which is the most necessary condition for a 
speaker independent recognition system 
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5 CONCLUSIONS AND FUTURE WORK 


S1 CONCLUSIONS 

Although the linear prediction (LP) based processors are popular m commercial speech 
recognizers recent studies have shown improved performance using mel based features 
which have now almost become the industry standard A new front end processor based 
on scale transform based features have been proposed by Umesh et al [5] for speaker 
independent speech recognition In this thesis we have studied the performance of these 
three front end processors in terms of their classification performance for vowels and 
isolated digits when there is a mismatch between tram and test conditions in terms of 
both the speakers and noise environments 

From the tables in Chapter 3 we observe that for the practical case of noisy speech 
the scale cepstral features are comparable to mel cepstral features and significantly better 
than LPC features for vowel classification It can also be observed that Warp2 method of 
scale cepstrum performs the best among all the three methods 

In Chapter 4 we compare the performance of an isolated digit recognizer based on 
vector quantization ideas using these three front end processors Experiment 3 of the 
chapter mimics a practical situation where we tram under clean conditions and test under 
noisy conditions ( since the noise environment under testing is not known in advance ) 
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Under these conditions scale based features ( both Warpl and Warp2 methods ) 
significantly outperform the mel based andLPC based features 


5 2 FUTURE WORK 

The lesults in this thesis indicate superior performance of scale cepstral features when 
compared to the conventional features The Warpl method of scale cepstrum has only 
been recently experimented with and further studies need to be done to determine 
empnical rules such as the number of cepstral coefficients to be used in the feature 
vector the sampling rate to be used in the spectral domain and the type of liftenng 
window if any to be used on the scale cepstral coefficients These empincal rules are 
necessary to optimize the tradeoffs between recognition accuracy and parsimony of 
feature space 
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